The code provided is an analysis of airline safety data. The dataset used in the analysis is the “airline-safety.csv” dataset from FiveThirtyEight’s Github repository.
The code performs the following tasks:
Creates a bar plot for the number of fatal accidents for each airline from 1985 to 1999. Creates a scatter plot to explore the relationship between the available seat kilometers per week and the number of incidents from 2000 to 2014. Creates a histogram to visualize the distribution of the number of fatalities from 2000 to 2014. Performs sampling techniques such as simple random sampling, systematic sampling, stratified sampling, and cluster sampling to draw samples from the dataset. Computes the incident rate for each airline, sorts the airline by incident rate, and displays the top 10 airlines with the highest incident rates. Creates two interactive scatter plots to visualize the number of incidents and fatalities for each airline from 1985 to 1999 and from 2000 to 2014. Overall, the analysis provides a glimpse of airline safety trends from 1985 to 2014, explores the relationship between different variables, and employs various sampling techniques to illustrate how samples can be drawn from the dataset. The interactive scatter plots provide an engaging way to visualize the data and draw insights from it.
https://github.com/fivethirtyeight/data/blob/master/airline-safety/airline-safety.csv
library(dplyr)
library(magrittr)
library(rsample)
library(plotly)airline_safety <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/airline-safety/airline-safety.csv")par(mar=c(4, 4, 2, 2))barplot(airline_safety$fatal_accidents_85_99,
names.arg = airline_safety$airline,
xlab = "Airline",
ylab = "Number of Fatal Accidents (1985-1999)",
main = "Fatal Accidents (1985-1999) by Airline",
col = "darkblue")plot(airline_safety$avail_seat_km_per_week,
airline_safety$incidents_00_14,
xlab = "Available Seat Kilometers per Week",
ylab = "Number of Incidents (2000-2014)",
main = "Relationship between Available Seat Kilometers per Week and Incidents",
col = "blue")hist(airline_safety$fatalities_00_14,
breaks = 20,
xlab = "Number of Fatalities (2000-2014)",
ylab = "Frequency",
main = "Distribution of Fatalities (2000-2014)")n <- 30
num_samples <- 1000
sample_means <- replicate(num_samples, mean(sample(airline_safety$fatalities_00_14, n)))hist(sample_means,
breaks = 20,
xlab = "Sample Mean",
ylab = "Frequency",
main = "Distribution of Sample Means (n = 30)")# Reduce the size of the plot margins
par(mar = c(4, 4, 2, 2))
# Initialize a plot window
plot.new()
# Add a histogram of the data
hist(airline_safety$fatalities_00_14,
main = "Distribution of Fatalities (2000-2014)",
xlab = "Fatalities")
# Add a normal curve with the same mean and standard deviation
mu <- mean(airline_safety$fatalities_00_14)
sd <- sd(airline_safety$fatalities_00_14)
curve(dnorm(x, mean = mu, sd = sd/sqrt(n)),
col = "red", add = TRUE)# Reset the plot margins to their default values
par(mar = c(5, 4, 4, 2) + 0.1)hist(sample_means,
breaks = 20,
xlab = "Sample Mean",
ylab = "Frequency",
main = "Distribution of Sample Means (n = 30)")set.seed(123)
simple_sample <- airline_safety[sample(nrow(airline_safety), 50), ]
head(simple_sample)## airline avail_seat_km_per_week incidents_85_99
## 31 KLM* 1874561773 7
## 15 British Airways* 3179760952 4
## 51 Turkish Airlines 1946098294 8
## 14 Avianca 396922563 5
## 3 Aerolineas Argentinas 385803648 6
## 42 Singapore Airlines 2376857805 2
## fatal_accidents_85_99 fatalities_85_99 incidents_00_14 fatal_accidents_00_14
## 31 1 3 1 0
## 15 0 0 6 0
## 51 3 64 8 2
## 14 3 323 0 0
## 3 0 0 1 0
## 42 2 6 2 1
## fatalities_00_14
## 31 0
## 15 0
## 51 84
## 14 0
## 3 0
## 42 83
The conclusion of the simple random sample is that it provides a representative sample of the population, but its accuracy depends on the size of the sample and how the sample is selected. In this case, the sample size is small, which may result in a higher sampling error.
systematic_sample <- airline_safety[seq(1, nrow(airline_safety), 20), ]
head(systematic_sample)## airline avail_seat_km_per_week incidents_85_99 fatal_accidents_85_99
## 1 Aer Lingus 320906734 2 0
## 21 Egyptair 557699891 8 3
## 41 Saudi Arabian 859673901 7 2
## fatalities_85_99 incidents_00_14 fatal_accidents_00_14 fatalities_00_14
## 1 0 0 0 0
## 21 282 4 1 14
## 41 313 11 0 0
The conclusion of the systematic sample is that it provides an efficient way of sampling if the data is ordered or arranged in a systematic way. However, if there is any periodicity in the data, it can lead to bias.
stratified_sample <- airline_safety %>%
group_by(incidents_85_99, incidents_00_14) %>%
slice(ifelse(n() >= 5, sample(1:n(), 5), 1:n())) %>%
ungroup()
head(stratified_sample)## # A tibble: 6 × 8
## airline avail_sea…¹ incid…² fatal…³ fatal…⁴ incid…⁵ fatal…⁶ fatal…⁷
## <chr> <dbl> <int> <int> <int> <int> <int> <int>
## 1 TAP - Air Portugal 619130754 0 0 0 0 0 0
## 2 Hawaiian Airlines 493877795 0 0 0 1 0 0
## 3 Cathay Pacific* 2582459303 0 0 0 2 0 0
## 4 Finnair 506464950 1 0 0 0 0 0
## 5 Austrian Airlines 358239823 1 0 0 1 0 0
## 6 Gulf Air 301379762 1 0 0 3 1 143
## # … with abbreviated variable names ¹avail_seat_km_per_week, ²incidents_85_99,
## # ³fatal_accidents_85_99, ⁴fatalities_85_99, ⁵incidents_00_14,
## # ⁶fatal_accidents_00_14, ⁷fatalities_00_14
The conclusion of stratified sampling is that it ensures representation of subgroups in the population. This method is beneficial when the population has a few subgroups that are significantly different from each other. In this case, it is seen that the stratified sample represents both the subgroups proportionately.
cluster_sample <- airline_safety %>%
group_by(fatalities_00_14) %>%
slice(ifelse(n() > 1, sample(1:n(), 1), 1:n())) %>%
ungroup()
head(cluster_sample)## # A tibble: 6 × 8
## airline avail_se…¹ incid…² fatal…³ fatal…⁴ incid…⁵ fatal…⁶ fatal…⁷
## <chr> <dbl> <int> <int> <int> <int> <int> <int>
## 1 All Nippon Airways 1841234177 3 1 1 7 0 0
## 2 Philippine Airlines 413007158 7 4 74 2 1 1
## 3 TACA 259373346 3 1 3 1 1 3
## 4 Air New Zealand* 710174817 3 0 0 5 1 7
## 5 Egyptair 557699891 8 3 282 4 1 14
## 6 Garuda Indonesia 613356665 10 3 260 4 2 22
## # … with abbreviated variable names ¹avail_seat_km_per_week, ²incidents_85_99,
## # ³fatal_accidents_85_99, ⁴fatalities_85_99, ⁵incidents_00_14,
## # ⁶fatal_accidents_00_14, ⁷fatalities_00_14
The conclusion of cluster sampling is that it is useful when the population is geographically dispersed or clustered. It reduces the cost of sampling, as only selected clusters are sampled. However, there may be bias in cluster sampling if the clusters are not a representative sample of the population.
airline_data <- airline_safety %>%
group_by(airline) %>%
summarize(total_seat_km = sum(avail_seat_km_per_week),
total_incidents = sum(incidents_85_99))airline_data <- airline_data %>%
mutate(incident_rate = (total_incidents / total_seat_km) * 1e9)airline_data <- airline_data %>%
arrange(desc(incident_rate))head(airline_data, 10)## # A tibble: 10 × 4
## airline total_seat_km total_incidents incident_rate
## <chr> <dbl> <int> <dbl>
## 1 Aeroflot* 1197672318 76 63.5
## 2 Ethiopian Airlines 488560643 25 51.2
## 3 Pakistan International 348563137 8 23.0
## 4 Xiamen Airlines 430462962 9 20.9
## 5 Philippine Airlines 413007158 7 16.9
## 6 Royal Air Maroc 295705339 5 16.9
## 7 Garuda Indonesia 613356665 10 16.3
## 8 Aerolineas Argentinas 385803648 6 15.6
## 9 China Airlines 813216487 12 14.8
## 10 Egyptair 557699891 8 14.3
str(airline_safety)## 'data.frame': 56 obs. of 8 variables:
## $ airline : chr "Aer Lingus" "Aeroflot*" "Aerolineas Argentinas" "Aeromexico*" ...
## $ avail_seat_km_per_week: num 3.21e+08 1.20e+09 3.86e+08 5.97e+08 1.87e+09 ...
## $ incidents_85_99 : int 2 76 6 3 2 14 2 3 5 7 ...
## $ fatal_accidents_85_99 : int 0 14 0 1 0 4 1 0 0 2 ...
## $ fatalities_85_99 : int 0 128 0 64 0 79 329 0 0 50 ...
## $ incidents_00_14 : int 0 6 1 5 2 6 4 5 5 4 ...
## $ fatal_accidents_00_14 : int 0 1 0 0 0 2 1 1 1 0 ...
## $ fatalities_00_14 : int 0 88 0 0 0 337 158 7 88 0 ...
plot_ly(data = airline_safety, x = ~incidents_85_99, y = ~fatalities_85_99,
color = ~airline, type = "scatter", mode = "markers", hoverinfo = "text",
text = ~paste("Airline: ", airline, "<br>",
"Fatalities (1985-1999): ", fatalities_85_99, "<br>",
"Fatalities (2000-2014): ", fatalities_00_14)) %>%
layout(title = "Airline Safety 1985-1999", xaxis = list(title = "Incidents (1985-1999)"),
yaxis = list(title = "Fatalities (1985-1999)"), hovermode = "closest")plot_ly(data = airline_safety, x = ~incidents_00_14, y = ~fatalities_00_14,
color = ~airline, type = "scatter", mode = "markers", hoverinfo = "text",
text = ~paste("Airline: ", airline, "<br>",
"Fatalities (1985-1999): ", fatalities_85_99, "<br>",
"Fatalities (2000-2014): ", fatalities_00_14)) %>%
layout(title = "Airline Safety 2000 -2014", xaxis = list(title = "Incidents (2000-2014)"),
yaxis = list(title = "Fatalities (2000-2014)"), hovermode = "closest")The code provided performs several data visualization and sampling techniques on a dataset named “airline_safety”, which contains information about airline safety from 1985 to 2014. Here are the conclusions drawn from each visualization and analysis in the code:
Bar plot for fatal_accidents_85_99: The plot shows the number of fatal accidents each airline had between 1985 and 1999. The conclusion we can draw is that most airlines had no fatal accidents during this period, and the few that did had one or two.
Scatter plot for avail_seat_km_per_week and incidents_00_14: The plot shows the relationship between available seat kilometers per week and the number of incidents each airline had between 2000 and 2014. The conclusion we can draw is that there is no clear relationship between these two variables.
Histogram of fatalities_00_14: The plot shows the distribution of the number of fatalities each airline had between 2000 and 2014. The conclusion we can draw is that most airlines had no fatalities during this period, and the few that did had less than 500 fatalities.
Distribution of Sample Means (n = 30): The plot shows the distribution of the means of 1000 random samples of size 30 taken from the “fatalities_00_14” variable. The conclusion we can draw is that the distribution of sample means is approximately normal, and the mean of the sample means is close to the population mean.
Simple random sample, systematic sample, stratified sample, and cluster sample: The code performs four sampling techniques on the dataset: simple random sample, systematic sample, stratified sample, and cluster sample. The conclusion we can draw is that each sampling technique has its advantages and disadvantages, and the choice of technique depends on the research question and available resources.
Incident Rate by Airline: The code calculates the incident rate for each airline, which is the number of incidents per available seat kilometer. The conclusion we can draw is that the incident rate varies widely among airlines, and the top 10 airlines with the highest incident rate are all from developing countries.
Scatter plot for incidents_85_99 and fatalities_85_99: The plot shows the relationship between the number of incidents and the number of fatalities each airline had between 1985 and 1999. The conclusion we can draw is that there is a positive correlation between these two variables, and some airlines had a much higher fatality rate than others.
Scatter plot for incidents_00_14 and fatalities_00_14: The plot shows the relationship between the number of incidents and the number of fatalities each airline had between 2000 and 2014. The conclusion we can draw is that there is a positive correlation between these two variables, and some airlines had a much higher fatality rate than others.